[SPARK-10809] [MLlib] Single-document topicDistributions method for LocalLDAModel#9484
[SPARK-10809] [MLlib] Single-document topicDistributions method for LocalLDAModel#9484hhbyyh wants to merge 7 commits intoapache:masterfrom
Conversation
|
Test build #45087 has finished for PR 9484 at commit
|
There was a problem hiding this comment.
Can you please remove the doc ID? It's not necessary for a single doc, and removing it will make this more Java-friendly.
|
Test build #45202 has finished for PR 9484 at commit
|
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines. (cherry picked from commit e281b87) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [apache/spark#9484], but I'll try to merge [apache/spark#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
|
@hhbyyh Sorry again for the delay, but we can get this merged now |
|
@jkbradley It's quite all right. Thanks for reviewing. Update sent. |
|
Test build #48895 has finished for PR 9484 at commit
|
There was a problem hiding this comment.
The Scala doc for this line is not generated correctly. Can you try removing the argument and just writing [[topicDistributions]] instead?
|
Sorry for the late response. Update sent |
|
Jenkins, retest this please. |
|
Test build #49109 has finished for PR 9484 at commit
|
|
Getting many TimeoutException. |
|
Test build #49124 has finished for PR 9484 at commit
|
|
LGTM |
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.